This dataset is shown in Figure 3B of the manuscript.
import stream as st
st.__version__
'1.0'
st.set_figure_params(dpi=80,style='white',figsize=[5.4,4.8],
rc={'image.cmap': 'viridis'})
adata=st.read(file_name='./data_nestorowa2016.tsv.gz',workdir='./stream_result')
Observation names are not unique. To make them unique, call `.obs_names_make_unique`. Variable names are not unique. To make them unique, call `.var_names_make_unique`.
Saving results in: ./stream_result
To load and use 10x Genomics single cell RNA-seq data processed with Cell Ranger:
(The variable index can be reset by choosing a different column ingene.tsv)adata=st.read(file_name='./filtered_gene_bc_matrices/matrix.mtx', file_feature='./filtered_gene_bc_matrices/genes.tsv', file_sample='./filtered_gene_bc_matrices/barcodes.tsv', file_format='mtx',workdir='./stream_result') adata.var.index = adata.var[1].valuesIf the Anndata object is already created, to run STREAM, please simply specify work directory:
st.set_workdir(adata,'./stream_result')
adata.var_names_make_unique()
adata
AnnData object with n_obs × n_vars = 1656 × 40594
uns: 'workdir'
st.add_metadata(adata,file_name='./metadata.tsv')
adata.obs.head()
| label | label_color | |
|---|---|---|
| HSPC_025 | MPP | #eea113 |
| HSPC_031 | HSC | #40bdbd |
| HSPC_037 | HSC | #40bdbd |
| LT-HSC_001 | HSC | #40bdbd |
| HSPC_001 | HSC | #40bdbd |
Alternatively, the step can be divided into two step:
(iffile_nameis not specified, 'unknown' will be genereated for cell label and random color will be used for each label)st.add_cell_labels(adata,file_name='./cell_label.tsv.gz') st.add_cell_colors(adata,file_name='./cell_label_color.tsv.gz')
st.cal_qc(adata,assay='rna')
st.plot_qc(adata,jitter=0.3,)
### histogram plots and log-scale are also supported
st.plot_qc(adata,jitter=0.3,log_scale=[0,1,4,5],hist_plot=[0,1,4,5])
st.filter_cells(adata,min_n_features= 100)
st.filter_features(adata,min_n_cells = 5)
filter cells based on min_n_features after filtering out low-quality cells: 1656 cells, 40594 genes Filter genes based on min_n_cells After filtering out low-expressed genes: 1656 cells, 35077 genes
###Normalize gene expression based on library size
st.normalize(adata,method='lib_size')
###Logarithmize gene expression
st.log_transform(adata)
###Remove mitochondrial genes
st.remove_mt_genes(adata)
Please check if the blue curve fits the points well. If not, please adjust the parameter 'loess_frac' (usually by lowering it) until the blue curve fits well.
st.select_variable_genes(adata,loess_frac=0.01,n_genes=2000)
2000 variable genes are selected
st.select_top_principal_components(adata,feature='var_genes',first_pc=True,n_pc=40)
using top variable genes ... 40 PCs are selected
st.dimension_reduction(adata,method='se',feature='top_pcs',n_components=4,n_neighbors=15,n_jobs=4)
feature top_pcs is being used ... 4 cpus are being used ...
Alternatively, using variable genes as features:
st.dimension_reduction(adata,method='se',feature='var_genes',n_neighbors=15, n_components=4)
st.plot_dimension_reduction(adata,color=['label','Gata1','n_genes'],
show_graph=False,show_text=False)
st.plot_dimension_reduction(adata,color=['label'],
show_graph=False,show_text=False,plotly=True)
n_components>=3 in st.dimension_reduction()¶st.plot_visualization_2D(adata,method='umap',n_neighbors=50,
color=['label','Gata1','n_genes'],use_precomputed=False)
st.dimension_reduction()st.plot_visualization_2D() by setting use_vis=True in st.seed_elastic_principal_graph(). An example can be found herest.seed_elastic_principal_graph(adata,n_clusters=20)
Seeding initial elastic principal graph... Clustering... K-Means clustering ... The number of initial nodes is 20 Calculatng minimum spanning tree... Number of initial branches: 5
st.plot_dimension_reduction(adata,color=['label','Gata1','n_genes'],n_components=3,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=True)
epg_alpha, epg_mu, epg_lambda are the three most influential parameters for learning elastic principal graph.
epg_alpha: penalizes spurious branching events. The larger, the fewer branches the function will learn. (by default, epg_alpha=0.02)epg_mu: penalizes the deviation from harmonic embedding, where harmonicity assumes that each node is the mean of its neighbor nodes. The larger, the more edges the function will use to fit into points(cells) (by default, epg_mu=0.1) epg_lambda: penalizes the total length of edges. The larger, the 'shorter' curves the function will use to fit into points(cells) and the fewer points(cells) the curves will reach. (by default, epg_lambda=0.02)'epg_trimmingradius' can help get rid of noisy points (by defalut
epg_trimmingradius=Inf)
e.g.st.elastic_principal_graph(adata,epg_trimmingradius=0.1)
st.elastic_principal_graph(adata,epg_alpha=0.02,epg_mu=0.05,epg_lambda=0.01)
epg_n_nodes is too small. It is corrected to the initial number of nodes plus incr_n_nodes Learning elastic principal graph... [1] "Constructing tree 1 of 1 / Subset 1 of 1" [1] "Computing EPG with 50 nodes on 1656 points and 4 dimensions" [1] "Using a single core" Nodes = 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 BARCODE ENERGY NNODES NEDGES NRIBS NSTARS NRAYS NRAYS2 MSE MSEP FVE FVEP UE UR URN URN2 URSD 2||50 4.098e-06 50 49 44 2 0 0 1.646e-06 1.402e-06 0.9904 0.9918 2.123e-06 3.295e-07 1.647e-05 0.0008237 0 9.8 sec elapsed [[1]] Number of branches after learning elastic principal graph: 5
st.plot_dimension_reduction(adata,color=['label','Gata1','n_genes'],n_components=3,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=False)
st.optimize_branching(adata,incr_n_nodes=30)
st.plot_dimension_reduction(adata,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=False)
Prune trivial branches:
st.prune_elastic_principal_graph(adata,epg_collapse_mode='EdgesNumber',epg_collapse_par=2)
st.plot_dimension_reduction(adata,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=False)
Shift branching node:
st.shift_branching(adata,epg_shift_mode='NodeDensity',epg_shift_radius=0.1,epg_shift_max=3)
st.plot_dimension_reduction(adata,show_graph=True,show_text=False)
st.plot_branches(adata,show_text=False)
###Extend leaf branch to reach further cells
st.extend_elastic_principal_graph(adata, epg_ext_mode='WeigthedCentroid',epg_ext_par=0.8)
st.plot_dimension_reduction(adata,color=['label'],n_components=3,show_graph=True,show_text=True)
st.plot_branches(adata,show_text=True)
Extending leaves with additional nodes ... Number of branches after extending leaves: 5
st.plot_dimension_reduction(adata,color=['label'],n_components=3,show_graph=True,show_text=True,plotly=True)
st.plot_visualization_2D(adata,n_neighbors=50,color=['label','Gata1','branch_id_alias','S1_pseudotime'],fig_ncol=4)
Importing precomputed umap visualization ...
st.plot_flat_tree(adata,color=['label','branch_id_alias','S3_pseudotime'],
dist_scale=0.5,show_graph=True,show_text=True)
st.plot_stream_sc(adata,root='S1',color=['label','Gata1'],
dist_scale=0.3,show_graph=True,show_text=True)
st.plot_stream(adata,root='S1',color=['label','Gata1'])
Some useful parameters to finetune the appearance of stream plots:
dist_scale: controls the width of STREAM plot branches.factor_num_win: controls the resolution of STREAM plot.preference: controls the order of brancheslog_scale: shows stream plot in log scale to zoom in on thin branchesst.plot_stream(adata,root='S1',color=['label','Gata1'],factor_min_win=1.5)
marker_list defines the list of genes to scan. If not specified, by default it uses all available genes. It might be time-consuming.
Here we only include variable genes.
st.detect_leaf_markers(adata,marker_list=adata.uns['var_genes'],cutoff_zscore=1.0,cutoff_pvalue=0.01,
root='S1',n_jobs=4)
Scanning the specified marker list ... Filtering out markers that are expressed in less than 5 cells ... 4 cpus are being used ... 2000 markers are being scanned ...
adata.uns['leaf_markers_all'].head()
| zscore | H_statistic | H_pvalue | S1S0_pvalue | S0S2_pvalue | S3S4_pvalue | S3S5_pvalue | |
|---|---|---|---|---|---|---|---|
| Mfsd2b | 1.73119 | 971.544 | 2.67993e-210 | 4.68512e-285 | 4.13815e-201 | 1 | 4.26616e-173 |
| Lsp1 | 1.06338 | 956.052 | 6.14683e-207 | 1.80664e-07 | 1 | 9.15373e-218 | 9.25667e-82 |
| Hlf | 1.53252 | 954.588 | 1.27707e-206 | 1 | 7.12118e-66 | 3.88563e-302 | 6.15962e-156 |
| Ly6a | 1.21555 | 947.545 | 4.30465e-205 | 1 | 3.70154e-13 | 8.96542e-295 | 9.74348e-130 |
| Gimap6 | 1.10812 | 932.144 | 9.43503e-202 | 1 | 6.85413e-06 | 6.85064e-262 | 1.02479e-165 |
adata.uns['leaf_markers'][('S3','S5')].head()
| zscore | H_statistic | H_pvalue | S1S0_pvalue | S0S2_pvalue | S3S4_pvalue | S3S5_pvalue | |
|---|---|---|---|---|---|---|---|
| Prtn3 | 1.44966 | 846.577 | 3.42291e-183 | 5.93466e-77 | 3.63307e-87 | 7.50851e-240 | 1 |
| Plac8 | 1.32103 | 597.985 | 2.756e-129 | 7.65062e-132 | 8.06652e-22 | 1.69166e-122 | 1 |
| Tyrobp | 1.49571 | 570.233 | 2.8593e-123 | 2.6061e-104 | 7.59096e-25 | 2.56518e-136 | 1 |
| Mpo | 1.68458 | 528.694 | 2.88406e-114 | 1.23585e-110 | 1.42819e-38 | 1.30874e-128 | 1 |
| Plppr3 | 1.10207 | 525.493 | 1.42456e-113 | 6.02785e-16 | 4.93673e-06 | 2.60273e-99 | 1 |
st.detect_transition_markers(adata,marker_list=adata.uns['var_genes'],cutoff_spearman=0.4,cutoff_logfc=0.25,
root='S1',n_jobs=4)
Scanning the specified marker list ... Importing precomputed scaled marker expression matrix ... 2000 markers are being scanned ...
adata.uns['transition_markers'][('S3','S4')].head()
| stat | logfc | pval | qval | |
|---|---|---|---|---|
| Blvrb | 0.879042 | 0.847545 | 2.370844e-171 | 3.990131e-168 |
| Tmsb4x | -0.873671 | 1.644629 | 1.030916e-166 | 8.675159e-164 |
| Fxyd5 | -0.872509 | 2.515321 | 9.747882e-166 | 5.468562e-163 |
| Car2 | 0.856566 | 0.432449 | 3.010774e-153 | 1.266783e-150 |
| Coro1a | -0.843045 | 2.242194 | 8.708529e-144 | 2.931291e-141 |
st.plot_transition_markers(adata,fig_size=(10,5))
st.detect_de_markers(adata,marker_list=adata.uns['var_genes'],cutoff_zscore=1,cutoff_logfc=0.25,
root='S1',n_jobs=4)
Scanning the specified marker list ... Importing precomputed scaled marker expression matrix ... 2000 markers are being scanned ...
adata.uns['de_markers_greater'][(('S3', 'S4'), ('S3', 'S5'))].head()
| z_score | U | logfc | mean_up | mean_down | pval | qval | |
|---|---|---|---|---|---|---|---|
| Mfsd2b | 19.506093 | 94986.0 | 1.823374 | 0.832106 | 0.234522 | 9.785071e-85 | 6.766377e-82 |
| Car2 | 19.309515 | 94512.0 | 0.912388 | 0.866174 | 0.459796 | 4.485631e-83 | 1.240726e-80 |
| Vamp5 | 18.935553 | 93616.0 | 0.862626 | 0.887242 | 0.487540 | 5.833307e-80 | 7.334057e-78 |
| Gata1 | 18.474235 | 92522.5 | 1.944163 | 0.836230 | 0.216688 | 3.342041e-76 | 3.081362e-74 |
| Slc14a1 | 18.233632 | 91954.0 | 1.694230 | 0.795594 | 0.245304 | 2.802189e-74 | 2.039698e-72 |
adata.uns['de_markers_less'][(('S3', 'S4'), ('S3', 'S5'))].head()
| z_score | U | logfc | mean_up | mean_down | pval | qval | |
|---|---|---|---|---|---|---|---|
| Prtn3 | -19.557227 | 1522.0 | 1.582582 | 0.312801 | 0.938732 | 3.594974e-85 | 4.971850e-82 |
| Hk3 | -19.437617 | 2153.0 | 2.936483 | 0.110801 | 0.853920 | 3.725267e-84 | 1.717348e-81 |
| Tyrobp | -19.414812 | 1899.5 | 2.345545 | 0.176271 | 0.899578 | 5.808346e-84 | 2.008236e-81 |
| Ctsg | -19.270820 | 2203.0 | 1.790200 | 0.270971 | 0.939501 | 9.480967e-83 | 2.185363e-80 |
| Mpo | -19.232438 | 2299.0 | 1.286169 | 0.389208 | 0.950567 | 1.988930e-82 | 3.929558e-80 |
st.plot_de_markers(adata)
st.detect_markers(adata,ident='label',marker_list=adata.uns['var_genes'],cutoff_zscore=1.0,cutoff_pvalue=0.01)
st.write(adata,file_name='stream_result.pkl')
To read back the saved .pkl file
adata = st.read('./stream_result/stream_result.pkl')
st.save_vr_report(adata,ann_list=['label','S1_pseudotime'],gene_list=['Gata1','Epx','Car2'])
st.save_web_report(adata,n_genes=5,
title='Nestorowa, S. et al. 2016',
description='This scRNA-seq dataset contains 1656 cells and 4768 genes from mouse hematopoietic stem and progenitor cell differentiation. A single-cell resolution map of mouse hematopoietic stem and progenitor cell differentiation. Blood 128, e20-31 (2016).',
starting_node='S1',command_used='1.2.STREAM_scRNA-seq (Multifurcation)')
Depending on the desired annotations and genes to be visualized by the user, the following command can be executed to create singlecellVR-compatible .JSON object.
scvr -f stream_result.pkl -t STREAM -a ANNOTATIONS [-g GENES] [-o OUTPUT]